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A large rvuinber,.of seemiingly diverse coefficients have been proposed as 
indxces,^t>f dependability, or reliabjrlity , for domain-referenced atid/or 
•Mastery tests. In thisr paper, it is shown that most of these indices 
are special cases of two generalized indices of agreement, on^ that is 
corrected for chance, and one that is not. The special cases of these ' 
two indices are determined by assumptions about the nature of thfe agree- 
ment function o<^^^uivalently, the nature of the loss function for the 
testing pr9cedure. For example, indices discussed by Huynh (1976), 
Subkoviak (1976), and Swaminathan, Hambleton, and Algina (1974) employ 
a threshold agreement, or loss, function; whereas^indices discussed by 
Brennan an(^ne (1977a, 1977br and Livingston (1972a) employ a squared, ^ 
error lo'ss function. Since all of these indices are discussed within a 
single general framework, the differences among^ them in tKeir assuirptions>. 
properties, and uses can be exhibited plearly. For purposes of comparison, 
norm- referenced general izability coefficients are also developed and dis- 
cussed .within -fthis general framework. > , 
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Introduction 

Glaser ana Nitko (1971) define a criterion-referenced test as "one 
tha'fc is deliberately constructed to yield; measurements that are directly . 
interpretable in terms of specified performance standards" (p. 653). 
This is probably the best-known 'definition of a criterion-referenced test, 
.but others have been proposed *(e.g. , Ivens, 1970; Kri^all, 1969; and 
Livings'ton, 197Sa) . Nothing in .the Glaser and Nitko definition, or in 
most other definitions bf "criterion-referenced test," necessitates the 
existence or use of a single criterion or cutting score as a "specified 
performance slandajtd." However, much of th^ literature subsumed under 
the heading of ' criterion-referenced measuremint does, in fact', postul^ite 
the existence of "^a single cutting score. Since this inconsistency in ^' 
terminology 'can lead to confusion, we prefer to reserve the term mastery 
test for a. criterion-referenced test with a single fixed mastery cutting, 
score. ■ ' 

- Hively (1974) and Millman (1974), among others, suggest using the 
descriptor "domain-ref erenged test" rather than "criterion-referenced 
test." They note that the word "criterion*; is ambiguous in some contexts, 
and they argue that-s^he word "domain" provides a more Sire'ct specirication 
of the entire set of items or tasks .under consideration. If one accepts " 
these arguments, a mastery test can be*defined as a domain-referenced test 
with a single cutting score. » ' 
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One can.also distinquish betv^een a Darticular tvpe. of test (cr.e., 
norm-referenced or domain-referenced)- :and the scores (or' interpretation ' 
of scored) resulting trom a test. For exainple, the scores from an^^ test 
might be "given norm^-ref erenced or domain-referenced ojiterpl^e'tations • 
Indeed, most of the literature that treats issues of' dependability (or ♦ ^ 
reliability) of domain^referenced tests ^Q|:ualiy treats the dependability ^ 
of a particular set of scores that are given (or provide^ a domain- ' 
referenced or mastery ^interpretation. in this paper, to obviate verbal 
complexity^ we will. of ten refer to norm-referenced, domain -referenced, 
and mastery "tests"; however a more coirplete verbal .desctciptioii would 
^refer to scores that ajr^e given (or prbvidfe) ^norm-referenced, domain- 
referenced/ ot mastery- ij!%erpret§tions for a particular testihg , procedure. . 

Since Popham and Husek (1969) challenged the appropriateness of 
correlation coefficients as indices of reliability for ^domain-referenced^ 
and mastery t^sts, considerable effort' has been devoted to dev.eloping more 
appropriate indices. Most of these indices have been t>roposed as measures" 

3:el lability ; however, we prefer to u'se the more ge'^ieric term, dependability , 
in order to aVoid unwarranted associatjbons with €he classical -theory Qf 
reliability for norm-referenced tests. - ' ' J , ' . / 

since a large number of seemingly^ diverse coefficients have been pro- 

posed, it has been difficult for evaluator^ 'i;o distinguish ^mong thei^ in T:* 

■* * . . *• "* 

meaningful ways. In this paper^ we show that most of theifee indices can be f » 
classified into two broad categories depending bn their underlying (and sotne- 
times unstated) assumptions about the nature of agreement or, equivalently, 
the nature of loss in the testing procedure. For .example, . indices discu3'sed by 
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r ^ , 

• . "* ' * * 

. Car-.-er (1970) , -Huynh U976) , Marshall and, Haertel (Note 1), Subkovi'ak - (l97,f) , 

>arid Swa^ninathan, Hainbleton, and- Algina (1974) employ a threshold agreement, 

or loss, function; whereas, , indices discussed by Brennan' and Kane (rf77a, 

■ ' - • . . ■ ' . . ' \ 

i977b) and Livingston^ (19723, 1972b, 1972c, ^1973L employ a squared-ertor ' ; 

loss function. We will> also show that within these two broad categories 
: . ^ ^ . 

mdrces c^n be differentiated witk respect to ^whether or not they 

. ■ • , ^^ ' ' 

incorporate a correction for chaoce agreement, in addition,, we will examine 
both the nature of. agreement an<r the role of chance agreement in . norm- 
re fererfced testing. . , • - ' - -~ 
We begin; by, using mictions of. agreement in order to develop two > ' ~ 

'•^ . . • , ; - • : 

generair?ed :^ndices ©^ dependability.. One of these indices is corrected 
for' chance agreement, and the 9^%t is not. .For both of ^t he.se general . " 
indites no specific agreement function is assumed. The' actual indices - 
that result from'' -several specific agreerRent functions arV then 
exanrslned in detail. This examination of a large lumber of indices' of 
dependability, within a sirigle consisfeftt:' framework, makes it "possible • 
to compare khd tsontra^t .the as^imptions, properties, interpretations, and 
uses of theise indices. ; > ^ ^ , . * ^ 



A 



10 



/ 



Indices of- Dependability ' 



h Pli£ of General fadices of Dependability 



Agreement Function , . ^ , 

. A 5.r,erar.expressio„ for the dependability of a testing procedure oin 
be derived by examining the Expected agreement betweey-t»o randomly selected " 
instances of the testing procedure . Anv particular instance a testing' 
^roced.re^will.be referred to as>i' "test, No a.s»pttons need to be ^de 
about the nature or^t|,-tests. 6he details'of their administration, or their 
scoring, since thS i&tances, or tests, are randomly selected from a universe 
•of possible-instances, they are randomly parallel . Therefore^ the expected 
distribution .Of outcomes for, the population is assumed to b^ the same for ' 
all instances o^ the .te;ting. procedure . This ^o^S.' not i.ply that the dis- 
. tributions Of outtomes are necessarily identical fof all tests; that is, „e 
are- not i^aking the stronger assumption- of classically parallel tests. 

The degree of agreement between any two's^re^, s^ and s . , is defined 
%Y an agreement function, a(s^, s.) :' The scores, a^d s,, ly be raw 
scores, or they may b^ transformed in so™ way, .For conv.enience; we shall' 

assiime that«*My a finite number of scores (s = . ^ , . 

t . . • ■ '-f'-'Su'' .My result from 

the use Of the- testing procedure. The f otm of the Igreemlat .function defines 
"ha. is .eant by agreement in any particular context. iV general, the agree- ' 
ment function will reflect ?n«^iti.e and, therefore, so^wh^ ar Jtrary' ' ' 
notions of -the relative degree of.agreement for' di^erent pairs pf sc6res. ■ 
AS we Shall see, the choice of an ^eement faction Wues the cholc.'oe. 
a loss function, where loss is defined as the difference between maximum 
possible agreement and observed agreement in a particular context. '. 
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Although we shall not assume ahy particular form for an agreement 
function in our development of a general indek of dependability, it is 
reasonable to impose some conditions on the class of functions that will 
be * accepted as agreement functions. , In the discussion that follows?/ it ' 
is assumed that all agreement' functions satisfy, the following three ^don- 
ditions: . , ^ - . 



(i) a<s., s.) > 0;^ ' . - 

a Ti — • ^ \ 

(ii) a(s.^ s.) -'a(s., s.); and ^ ^ , ' (1) 
a ^ -j^ —0^ 

(ill) a(s\ , s. ) + a(s ^/ s ,) > 2a(a. / s .) . 



Given tha:t are examining' the agreement between randomly parallel 
tests, the first two of these cojaditions are certainly natural* The third 
condition simply states. that the agreement assigned to any pair of scores, 
s. and s*\/ cannot be areater thah^>4:he aveifage of the agreemefits that result 
from pairing each of these scolds with itself. All the agreement functions 
discussed in this paper satisfy these three conditions. 

Maximum Agreement and the Index 8 ^ v 

The score for p'erson y on the k-th instance of th^ testing procedure can 
be represented by the rafndom variable S^j^* Similarly^ S^^ is the score for 
person w on test 1^. For every person v and every" test S^^.tak-es one of 
the values s /. .. .s . We might, then, take as our index of dependability 
the expected agreement given by^; ^ ^ ^ ^« ' ^ 
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where .expectation is taken Qv^r . ttie population of persons and over pairs of ■ 
tests that are- independently sampled from the universe of, tests . ' The expected 
agreement nay also fee represented in terms of the joint distribution of scores 
on the two tests: " « V 

n ♦ ^ ■ , 

where P£lS^^ - s^^ %i " "j^ Probability that a randomly chosen person, 

v,,got scores and s^, on randomly chosen tests, k and 1. ' Equations 2 and' 3 
represent the- same quantities expres'sed in two different ways. In the follow- 
ing discussion, we shall use whichever of these expressions is most"Sonvenient 
for the issufe under consideration. The notatiorr in Equation ^3 can be' simpli- 
fied by letting. ' " ' : ' _ 



and 



. ^ Pr(S ^ = s., S ^ ^ sj 
Equat:ion 3 can then be written as 



n 



However, the index. A, depends on the scale chosen for a.., and can be 
made arbitrarily large by mult#lyiiig a^ . by a sufficiently large constant. 
One way to correct this problem is to take^ as the index of -agreement : 
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A 



e = — • .. . • (5) 

A ■ • 

o • • ■ ■ • 

■ is? ^ 



In Equation 5, A is the expected agreement between an instance of the 

• lit i " i 

testing procedure and itself: 

A = I a(S S ^) . 

- Z a(s. / s.) ' Pr(S , ^ s.) ; * 



\ 



(6) 



i=0 



and in the simpler notation. 




where /p. is the 'probability that a randomly sele6ted person will get the score, 

, on a randomly chosen instance of the testing procedure. A is equal to A 
when every person in the population gets the 'same score on every instance of 
the Wsting procedure; i.e.^ wheh all instances of the testing procedure are in 
perfect agreement in the assignment of scores, s^, to persons dn the population 
Using the three conditions in Equation 1, it is easy to show that for any 

\ 

marginal distribution, A is the maximum value of A. 

Since £^ is a marg'inal probability, » 

: - A = r a.,p, = I -a.",p.. , * , 

•v- ~ i i'l , 

and ' " 



Pt = S a . .p. = £ a , .p. . 



There fozre, can be written as 
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A =v £ 
\— in ^ , 

- hi 



-a', . + a. . 
-11 



p . . . 



Now, using Assumption iii in Equation 1 



A > Z a. .p. . ; 



an^ using Equation 4 



A > A. 
—m — — 



, From the. definition of 6 in Equation 5, therefore, it follows that A 

• -in 
is the maximum value of the agreement function, and 6 is less th^n dr equal 

to one • ^ ' 



Chance Agreement and the Index e ^ 

; " £ ^ 

The coefficient 9 provides a ^^neral index of dependability for any 
agreement function, but it does not consider the contribution of chance 
agreement to .the dependability of measurement. As we shall see, 8 may be 
,^large ,even when Scores are randomly assigned to 'persons on each instance of 
the testing procedure. When we say that a scbi?f i^ assigned to examinee v, 
^ 2]^ance,. we mean that the score is randomly selected from the distribution 
of .^scorej for -the population of persons on the universe of tests... The 
assignment of to examinee v, by chance, depends 'onlA on the marginal 
probability, , of the score ^ „ and not on t^e examiLe's performance. 
Therefoi'e, for chance assignment, the score assigned .to k,examinee on any 
particular instance of the testing procedure is indepenaent'on 'the score 
assigned on any other instance, 

. 15 
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The contribution of chance agreement can be examined by taking the 
expected agreement between .the score, S^^, for person v on the k;-th test,, 

and the score, . for an independently sampled person, w, bn an inde- 

^1 ' ^ * 

pendently sampled test, 1^: 

A - ^ a(S ^, S ) ' (8) 

or 

n " ' , ■ 

A = Z a(s.\ s.) • Pr(S , = s., S . = s.). (9) 

- ia=o " - -— 

Since both persons, v and w, and tests, k and 1^, are sampled independently, 

. / 

\ 

\ 

Pr(S = s. , S = s .) = Pr(S = s.) • Pr(S = s^.)' ' . (1^'> 
— -i -vl "2 Z]i ifi 

I 

wh«r^ Pr(S = s ) is the mar-qinal probability that a randomly selected person 
wil„l get the score, s^, on a randomly "bhosen test. Substituting Equation 10 . 
in Equation 9, and using the simplified notation introduced earlier, we have 



A = I a.. p.p.. ^ • (11) 



i , 1=0 "ii^"^ 



For any agreement function. Equation 11 depends only on the marginal distribu- 
tion for a single administration of the tes'ting procedure." , is the expected 
agreement for pairs of scores when each score is independently sampled from the 
marginal distribution of the population. ^ " . - 

A general index of dependability, corrected for chance, can then be de- 
fined as: 
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-».> * . , 

- A - A > 
"-in 

The numerator of .Equation 12 provides a measure of hDw much the expected; 
agreement for the testing procedure exceeds the ^expected agreeiD^i^l: due to 
chance. Since ^ . • " ^ 



A > A, 




- the denominator of Equation 12 is the maximum' value of the numerator and ' 
is less than or equal to one. 

Loss ^ ' '\ ~ ^ * 

Although .rgst of the discussion in this papfer is concerned with ^gr«e- 
•ment functions, it will be useful in some places to discuss ; the expetied dis 
^gareement* or loss associated with testing procedures.- The expected ioss, l, 
for anV testing procedure is defined as the difference between the maximuit. 
expected agreement and the expected agreement: / 



L = A - A . . (13) 



Using this definit-ion. Equation 5 can be written: 



1% 
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and Equation 12 can, jDe writtetJ: 



• A - A 
— — c 

e = ~ — ♦ (15) 



c 



(A - A ) + L ^ / 



Mote that Equations VI and 15 have the form of classical reliability coeffi- 
cients, or generalizability coefficients, with L taking the place orff^or 
variance (see Cronbach, <31eser, Nanda, an^^Rajaratnam, 1972). 

In te rpre tat io n of 9 and 9 

The two indices, 6 and 6^, address different questions ^about dependability. 
Some of the properties of these indices will be discussed more fully late/xn 
the context of particular agreement functions , , but a brief statement i«^ppko- 
-'^priat:© here- 

-9- indicates how closely, in terms of the agreement functarSnT^e scores 
for any. examinee can be expected to agree. 6 indicates- how closely "(again,, 
in terms of the agreement function} ''the two" scores for an examinee can be 
expected to agree, with th^ coYitribution of ' chance agreement reirvoved . For 
the aglreeniient functions dis<:?ussed £n bhis^paper, 0* is less than or equal to 

The index, 0, therefore, characterizes the dependability of decisions, 

or estimates, based on the testing procedure- The magnitude of 9 dependi^, 

in part, on chance agreement; it may be greater than zero even when decisions 

based on the testing procedure ^re no more dependable than decisions based on 

marginal probabilities in the population. The index, 9 , characterizes the 

c 

contributiori of %he testing procedure to the dependability of the decisions^ ^ 
over what would be expected on the bWi^ of chance agreement. 6 provides an 
.estimate of the dependability of the decisions based on the testing procedure! 

. ■ 18 . 
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Provides an estimate of the contribution of the testing procedure to the 
Wendability of such decisions.. The two indices .-provide answers to different 
questions- ' The issue is not which of these indices .is best, but rather ^H<^, 
is appropriate lA a given, context. - ' *" 



. ^. gill fg.^ Threshold Agreement 
Threshold - Agreement F unction 

One coimnon use 'of tests is to classify examinees into two or more mutually 
exclusive categories. If^here are only two cat^ories, or ^ the categories 
are unordered, then a fusible agreement" function for the classification pro- 
cedure is given by the threshold agreement function, t: 

' : f ■ , . . : ' 

1 if S = s 
"t ( S ^ - / ~vk :rwl 

. , ■ — • i O'if S , S 

^ > f ~ * ^ • ^ — ~ 

/ . ■ , - 

where ^ the scote (in this case the dategory) for examinee v 

on the test k. Equation 16 can be expressed more succinctly as: 



,1 if s . = s . 
0 if s . s . 



where the score s^ represents assignment to the i-th category. Equati,||i7 has 
the adv^tage of i^notational -simplicity, whereas Equation 16 is a more detailed 
statement of the threshold agreement function, 't. |or either expression, the 



Indices df Dependab;Llity 
14 



assigned • agreement Is one if the examinee is placed in the same category on 
both administrations of .the procedure, and agreement is- zero if th# examinee 
is placed in different categories on the two administrations . it is easily 
verified that the agreement function in Equation 16 satisfies the three 
conditions in fequation 1* » > 

y 

The Index 0 (t ) • 

Sub^ituting t , given by Equation 17 , for a . . in Eauatiorv'''4 , we obtain 
— . , —gj . ^ / 

the expected agreeinent for classification procedures": 



A(t) = I t. .p. . = E p. . 



(18) 



The maximum' agreement is given by: 



^ ^ A^(t) - z t..£, = 1.^ / ; ^19) 



x 

The definirion of 9, an index of dependability not corrected for chance. Is 
provided in Equation 5- For the threshold agreement fun^:tion, this index is 
given by ? . ; " 

9 (t), = ^ , ^ \ ' I ■ s <20) 



and svxbstitu^ing Equations 18 and 19 in' Equation 20, we oJ^tain 



n 
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Equation 21 -states that the dependability of the classification procedure 
is Simply the probability that ^ rindonvly 'chosen examinee will be placed in ' 
the sair^ category on 'two randomly chosen instances of the procedure. Note ' ' 
th^t A(t), A^(t) and ect) are all equal to one .rf the :classif ication prWdure 
consistently places all examinees into a single category. 

, Equation 21- is Stated in terms of population parameters. Est-imates 
of e{t), based on two administrations of the testing pr^^cedure have been 
discussed by Berger (Note 2), Carver (1970) / and Swaminathan, et al. (1974)... 

♦ 

Estimates of 6 (t), based on a single administration of the testing procedure, 
- have been discussed by Marshall and Haertel (Note 1) , Subkoviak (1976, ' • 

Note 3), a^Spjw' Subkoviak and Albrecht (Note 4). ' 
•The Index (t)^ •,. ^ , ' * 

'J^^ definition of 0^. ^x. index of dependability corrected for chance, 
is provided in Equation _ 12. For the threshold agreement function in EguaVon 
17, the exp^jcted agreement due to chance ig- > * 



A^(t) = I t. n 



i,3 



^3 



= I 



% • • \ (22) 



1 — 



S^tracun, A^.t,- fro. the numerator and aeno.ina.or ot 9,t, In Equation 20 
we have t.e index of depenaaMiity correctea for chance, for a threshold 

agreement function: 



Indices of Dependability 

In the . special case where all exaininees are consistently placed in a single 
category, A U) is equal to one and- d (t_) is indeterminate. 

The index ©^(t) in EqOfation 23 is identical to"'"Cohen's (ISfO) coefficient 
kappa, ^'nd to Scott's (1955) coefficient, under o\ir assumption that the ex- 
pected marginal distributions for the two instances of the testing procedure 
are identical. As such, 6 (t) has been proposed by Huynh (1976) and Swaininathah 
et al (1974) as an index of reliability for ftastery tests with a single cutting 
score. • ^ 

Threshold Loss ' ' , 

The loss agsociated with a threshold agreement function can be determined 
by subtracting Equation 18 from Equation l9: . » 



-(t) ' - A (t) A(t} 
' — , — • 



If the two instances of the te^^ting procedure assign' the person to the same 
category, the loss is zero. If the two instances assign a person to different 

categories^ the loss is one, regardless of which categories are involved. This 

• ■< 

is consistent with the usual definition of a threshold lolss function (see 
Hambleton ^nd Novick, 1973) . . v 

I n te rpr e ta t ion of 0 { t ) and 6 it) 

The first block of Table 1 summarizes results for the parameters, 
*^Jt>' ^^^1^ ^ -3^^ Mt), and thfe agreement indices, 8(tJ and ^^it) , for 
the thi^eshold agreement function^ t. ' . 
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Insert Table 1 about here 



As noted earlier, 6 ( t^) will be equal to one whenever all instances of 
the categorization procedure place all persons into one category. The 
testing procedure used to assign persons to caitegories is then perfectly 
dependable and compretely Superfluous. Once it is -established that all, or 
alrnost all, persons fall into one category, there is little to be gained by 
administering tests* , . ' 

^ If alnpst everyone is in one category, the expected chance agreement, 
A Ct ) , will be close to A (t) , the maximum expected agreement. Under these 
circumstances, it would be difficult for any testing procedure to provide 
a significant improvement in dependability over chance assignment. Conse- 
quently, the coefficient corrected for chance, 6 (t) , will tend to be smali' 
whenever the testing procedure places almost everyone in the same category. 

Therefore, (^) is liable to one of the objections ra^tsed by Popham 
and Husek (1969) against qlassical reliability coefficients as indices for 
mastery tests — namely^ 8 (t) may be close to zero even when individuals are 
consistently placed in the correct category* However, this property of 6^<t) 
does not point to any* basic flaw in the coefficient, but only to a possible 
misinterpretation of the coefficient* A low value of ^^^t) does not necessarily 
indicate that assignments to categories are inconsistent from one administration 
to the next. Rather, a low value of 9 (t) indicates that the use of the testing 

t \ c f 

procedure in classifying ^dividuals is not much more dependable than a process 
of random assignment based on prior information about the population (i*e*/ 
the wiarciinals in the population) . Note 
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that e(t),is large whenever .the classification of examinees is consistent 
from one instance of the testing procedure to another? therefore, 6 (t) is 
n(^- subject to, Popham and Husek's objection. 

Contrary to a suggestion- by Subkoviak (1976) , the two coefficients 
developed from the threshold agrieWnt function are not appropriate when 
tnere are more than^two categories, and these categories are ordered in ' 
some way. , The ' threshold agreement function in Equation 16 assumes that 
the categories a're not ordered in any way. > ' , 



^ 9^ Domain - Referenced Agreement- 



In our discussion of doi|ain-referenced agreement, we shall assume that, 
for each instance of the testing procedure, a random sample of n items is 

4 

draWi from ?ome infinite domain (or 'universe) of items, and the sample of 
items is administered to all examinees. ■ ' /, ^ 

1^ the last section we used a" threshold loss- function to examine the 
dependability of procedures that assign each examinee to one of a set of. 
qualitative categories. In this section,, we shall examine the dependability 
of domain-referenced testing procedures. We shall emphasize the use of such 
procedures for mastery decision^ with a single cutting scsore, but we shall 
also discus^ the use of domain-referenced tests in the absence of .a speci- 
•fied cutting score. « \ . 

Tfce score for person v on item i can be represented by a general linear 
model: ♦ 



X^. . u+ TT^ -h B. + {tTB,e) (24) 



24 



where 
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^ X . = observed score for person v on item i? 

It = gran<J mean in the populat.ion of persons and the 
universe of itepis; " . ♦ 

ir ^ effect for person ^vt ^ . I ^ . . 

- effect for item i^;^ ^. . * - 

{-n-S^j&jl = effect for the interaction of person v and item i^^ P^' 
which is confounded with residual error; , 

and all effects are assumed to be independent random effects • In*the usual 
.case, where each examinee responds once to each item, the interaction effect 

^^and residual error are completely confounded and, therefore, these two 

^ effects are combined ijj Equation 24. 

In the di*scussion that follows, the observed score for person v will be 
taken to be the mean score over the sample of n. items To be consistent with 
our earlier notation, we will let the subscript indicate. a 'particular sampl 
o/ n items,' a^d we will designate a person's observed^ mean score as: 



S^I = p + tr^ ^ + (we,e)^j . ^, (25) 



Similarly, the score for person, on the J-th sample of n items" is 



S = u + W + + (7rB;e) , . <26) 



Note' that s _ and s _ are observed scores; they are not the same as- S . and 
r'vi ~wj '' — \^]^ 

^1 "^^^ previously to denote categories to which persons are assigned. 
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Domain- Referenced Acnjeement Function • / ' ' . *; " 

- . ^ — — , — — ». ^ , . , ^ ^ ^ . 

One:.tpurce of diSficulty with e{t) and 8 (t) for mastery testing is the 
nature of the threshold agreement fur>ction'. WTTen mastery x testing is used. ' ^ 
to make placement decisions, errors may involve very different degjfees o£ f 
loss- If a mastery test consisting of^ a sample from a universe of spelling 
words h^s a cut-off of 80%, the consequences of miscla^sdfying, a student with 
a universe score of 79% ^ are likely to be far less serious than the conse-- 
qUences' of misclas^ifying. a student with a universe score of 40% • A threshold 
loss, function assigns the same loss to both of th^se cases. ^ ^ ' - 

r This suggests that the agreement function for^ domain^-ireferenced tests 
that are used for mastery decisions should involve , the distance of tfje obs.erved 
score from ghe cut^-ting score. For- a cutting score, \\ the domain-re ferencecj 
agreement function is defined by:*iP 

cl(S . S = (S X) (S ^ - X) , , ' (27)" 

wherfe and J refer to independent samples of n items. Equation 27 assigns H 
positive agreement to two scores that result in the same classification, mastery 
or non-^matstery. It assigns a negative agreement to two scores that result in ' ^ 
different ^classif ications* .In either case, the magnitude of the agreement 

' ^ : . . ^ . ' ■ . 

depends on the magnitudes of two deviation scores, (S - X) and <S ' X) 

" ^I ~wJ^ 

If both of these deviation scores are close to zero,, indicating a ^ "bordeJrline* 
case,'' the magnitude of the agreement function will be close to zero. If both 
of these deviation scores are large and in the same direction, indicating strong 
agreement^ the dbmaih-referenced agreement function will be large and positive* 
if both deviations are large and in opposite directions, indicating strong dis- 
agreement, the domain-referenced .agreement function will be large and negative. 



If 
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The aomain-referenced agreement function in Equation 27 is similar to the 
definition of agreement used by Livingston a972a) in developing* an index of 
reliability for mastery tests. However, 'Livingston assumed that the two . 
tests were p^allel in the sense of classical test theory. We base our 
analysis on ' generalizability theor,| which makes the weaj^^r assumption that 
th. tests are randomly parallel. As a result, the ^indices derived here differ 
from Livingston's coefficient in several significant ways. 
The Index 9 (dV • 

using the d\nain-referenced agreement function in Equation 27 and the ' 
definition of expected agreement in Equation 2, we obtain 



A(d) =1 f^^§vl - X . (28) 



NOW, using Equation 25 to replace S^^ and S^^ in Equation 28, " 

■ ~~ ~~ ^ 

> I 



Since the effects tt, g, and {tre,e) are assumed to be sampled independently, 

^ and u arid > are constants , the expected value of the cross-products are zero,- 
and Equation 29 reduces to 

.. . ... . ^ 

A(d) . ^ (M - A),? 5.^2 . f e^S^ ^ ? (.6,e) {.re,e) 



VI ^^^'^^J " , 



P7 

/ — y 
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Because the two sets of items are independently sampled, the last two terms ^ 
in Equation 30 equal zero. Also, by the definition of a variance component 
o^'(it) = y and, therefore, the^expected agreement, for the domain- 

referenced agreement function, is ' * 

A(d) = (y - X)2 +^a^(TT). ",(31)" 

Similarly, the maximum expected agreement is found ^ by using Equation 27 
and the definition of maximum expected agreemeht in Equation 6: 



A (d) = $ (S - X)^ • * 

v^I — 

u^iB) ,o^(trB,e) 

- (I! * X)^ ^ a2(Tr) > + ^ , ^ (32) 

n n 



where is the. number of items sampled for each instance of the testing pro- 
cedure. Substituting Equations 31 and 32 in Equation 5, tha index of depend- 
ability for mastery decisions is given by: 

(y - A) 2 + ri2(7r) ' ' - 

Bid) = ^ , 



(33) 

(U - X)2 + 0^ iv) + + ; 

n n 



Equations for estimating 9 (d)^ have been discussed by Brennan and Kane (1977a)* 
The constant, n, appears in Equations' 32 and 33 because the observed scores are 



assumed to be averages over n items. 
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It is clear from Equation 33 that 6(d) will tend . to be large when (u - X)^ 
is large (i*e*, when the population mean is very different from the cutting 
score) even if a^^it) is zero* If all examinees have the ^ame ^universe 
score, a^(T^) is zero, and (iJ - X)" provides a measure of «the strength of 
the signal that needs to be detects for accurate classification (see 
Brennan and Kane, 1977b)- If this signal is large the required decisions 
are easy to mak^, and it is possible in such cases to classify examinees 
dependably, even if the test' being used does not provide dependable infor-* 
mation about individual differences among universe scores. 

The Index 6 <d) ^ ^ . 

— — ~c 

Using the domain-referenced agreement function in Equation 27 and the 
definition of chance agreement in Equation'' 8, the expected agreement due to 
chance is: 

A (d) - t lis ^ - A) (S ^ - A)] . * ^ 

Replaeing S^^^ and S^^ from Equations 25 and 26, and' taking 'the expected value 
ove^- I^^ and J, the e:>^ected chance agreement for the domain-referenced 

agreement function is: 

A (d) - (u ^ A)^ , , (34) 



Subtracting A (d) from the numerator and denominator of Equation 33, the 

--C 

domain-referenced index of dependability^ corrected' for chance agreement is: 



e^(d) - 

^ f m ,d2>B,e) * . ^ ^ ^^^^ 
a^<trX. 4- — 4^ --^ — 



n n 
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The estimation of this index is discussed by Brennan and Kane (1977b), and 
its relationship to ^-^21 is discussed by Brennan (1977b). ' ' 

Note that e (d) is zero w^en (n) is zero. If the test is to provide 
more* dependable classification of examinees than could be achieved by chance, 
it must differentiate among the examinefes* Therefore, some variability in 
universe scores is required if the test is to make a contribution to the ' 
dependability of the decision procedure* 

"Domain- Referenced Loss and (a) 

For the dpmain-ref erenced agreement function, the expected loss can be' 
found by subtracting Equation 31 from Equation 32: 



L(d) = A (d) - A(d) 
-tn — — 

0^(0) o^{jT6,e) 
^ — + w . (36) 



n in * ^ 

The loss, L(d), is therefore equal to the error a^{lx), which is discussed by 
Cronbach et al (1972), Brennan (1977a; 1977b) and Brennan and Kane <1977a, 
1977b). The error variance 'a^ (A) is appropriate for domain-referenced 
testing, m general ^^and for mas te'ry -testing, in particular, 

i 

In mastery testing, we are interested in "the degree to "'Which the student 
has attained criterion performance" (Glaser, 1963, p. 519), independent of the 
performance of ot'her students. That is, we are not primarily interested in 
the relative ordering 'of examinees* universe scores; rather, we are int^ested 
in th'e difference between each examinee's universe scor^^nd the absolute 
standard defined by the itiastery .cutting score. In generalizability theory; 
the universe score for examinee is^ by definition^ 

I ■ 
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V! = & S _ « p + tr 

~ I y~ 



where S^^^ is defined by Equation 25," and , the expectation is taken over all 
possible random samples of n''items from the i^niverse of items. Therefore", 
for a mastery test, the error for ,ex$iminee V isi 



' = (S ^ - A) - (li - A) ' 

V — vl V 



= S ^ - y 
— Vl V 



= [p + TT + e + (Tr0,«) y - [ p + TT ] 



I — vl 



**4 

and the variance of over persons and randoi^'x^ajrples of n items is 0^(6) 
given by Equati'on 36, 



When all students receive the same items, as im^jliW by the linear model 
in,. Equation 25, the main effect due to the .sampling of items, 3 , affects all 
examinees' observe^l scores in th^^fcme way. For mastery testing, ^^Nl^^, 



Hi,, 



this does not eliminate the item .ef fect as a- source of.earror, because our 
interest is m the absolute magnitude of an examinee ♦s score, not the magnitu 
relative to the scores- of o€her examinees. For example, if we happen to select 
an Especially easy set of items from the universe, our estimates of y {for 
the universe of items) will tend to be too high for all examinees; thi^s error 
is accounted for by B^. 
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Interpretation, of 6(d) and^ 0 (d) 
— — ^ ^ — «^ ^ _ 

The second block of ' Table 1 summarizes results for the parameters^ ^^^^ ' 
A (d) , A (d) , and L(d>, and the agreement indices^ 8(d) and 9 (d) i f or the 
domain-referenced agreement function, d* 

The difference in internretation between 8 (d) and 8 (d) parallels the 
difference between 8 ( U and 9^ ( t) • l^he index 0(d) characterizes the depend- 
ability of decisions or estimate^ based on the testing procedures. The index, 

e (d) , indicates the contribution of the testing procedures to the dependability 

c^ — . 

* > * , 

of. these decisions or estimates. It is clear from Equation. 33 that 6 (d ) may 

V * 

, be large even when there is little or no universe score variability ' In the 
population of examinees* From Equation 35, however, we see that &^(d) is equal 
to Hexo when there is no universe score variability xn the population Lassumx^ig 
> O] . ^ 
Norm-referenced t^ests compare each examinee's score to the scores of other 
examinees, and, therefore, require variability if these comparisons are^ to be 
dependable* In their now classic paper, l^opham and HuseJc (1969) maintained 
that 'Variability is not a necessary condition for a good criterion-referenced 
test*' (p* 3), They argued that since criterion-referenced tests are "used to 
ascertain an individual's status with resjfect to^ some criterion" (p. 2), the 
meaning of the score is not dependent on comparison with ot^er scores* Pppham 
and Husek conclude, therefore^ that indices of dependability that reqxiire * 
variability are appropriate for norm-referenced tests but not for criterion-- 
referenced tests • 

Although, the position adopted by Popham and Husek seems plausible, it 
leads to a very disturbing conclusion* As Woodson {1974a, p. 64) has pointed 
out, ''items '-and tests which give no variability give no information and are 
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therefore not useful." We^re faced, therefore, with the apparent contra- 
diction, or paradox, that tests which provide no inforrcation about differences 
-among individual examinees can be good criterion-referenced tests. In two 
subseqben t articl es, Millman. ^d Pophajn (1974) and Woodson (i9?4b) clarified 
the two si'SJTTuIis dispute without resoling the basic issue. 

The general framework developed here provides an.oblrious resolution of 
this paradox. As we have seen, two types of coefficients can be d^eloped 
for any agreement function, depending upon whether or not one corrects for 
chance agreement. Coiff icients, such as 0(d), that are not corrected for 
chance provide estimates of t^e dependability of the decision procedures; 
and such coefficients^ may be, large even without variability in uni>|erse 
scores. . By contrast, coefficients such as 6 (d ) , that are corrected for ^ 
chance provide an estimate of the contribution, of tKfe test' to 'the depend- 
ability of the. decision procedure. Such coefficients will approach zero as 
the universe score variance approaches zero, ^popham and Husek's argument 
applies to the decision procedure|, and cpefficljpts nat corrected for chance " 
are appropriate for characterizing the dependability of the decision procedure. 
Woodson's ar'gument applies to tT>e Contribution of the test to the decision 
procedure/ and coefficients corrected for' chance are a^ropriate for 
characterizing^ the contribution of t>.etest to the dependability of the 
decision procedure. . • ♦ 



♦ 



Doroain-Re f erenced Agreement Without a Cutting Soore 

The domain- referenced agreement function in Equation 27 is the product 

of deviations ..from a constant. The discussion up to this point has focused 

on mastery testing, and .\ has- been taken as the mastery cutting score. 
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However, a' single doinaih~ref erenced test may be used for several different 

decisions, involvinq different cutting scores* ^ In such cases, it would be 

useful to have an index of dependability that does nob depend on a particular 

cutting score, fts discxissed earlier, 9 id) is independent of A , - and 6 (d) is 

appropriate for assessing the contribution made by the test to the depend-- 

ability of mastery decisions using any cutting score. Furthermore, 

less than or equal to 9(d) for all values of and the twe are equal only 

when A - Therefore, 6 (d) provides a lower bound for 6(d) (see Brennan; 

1977b) , ' > • ... 

Moreover, domain-referenced tests do not necessarily involve any con- 

sideration of cutting scores. For example, the ;Scorer £^j^ ^ domain* ^ 

"referenced test may be interpreted as a descriptive statistic which estimates 

u^^ , the examinee's uiniverse score (i.e*, percentage of items that could be 

answered correctly) in the domain (see Millman and Popham, 1974) * When 

using domain--referenced scores as descriptive statistics, we are interested 

in point estimates of the examinee * l^universe score, p^* As we have seen, 

the error (or noise) in such point estimates of universe scores is given by 

\^ , and 8 (d) therefore incorpo'rates the appropriate ^^r^or variance o^(&)* 
V £ / X 

The universe score variance, a^(w) in Bid) provides a measure of the dispersioji 
of universe scores in ^he poptilation.' There is a strong precedent in physical 

measurement for taking the variability in universe scores .as a measure of the 

» 

magnitude of the signal to be detected. General-purpose instr;uments for 
measuring length, for example, are typically evaluated by their ability* to 
detect differences of the order of magnitude of those encountered in some area 
of practice. Thus, ^rulers are adequate in carpentry, but verniers are necessary 
in machine shops. 
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^ 6^ i2E Norm-Referenced Agreement 

The agreement function that is 'implicit in general! Eability. coefficients 
(see Cronbach et al, 1972, and Brennan, 1977a) is: 



Where is the expected value of over the population of persons for the 
set of itenTs I; that is, ' " . 



Similarly, 



The parameters., 6^ and e^, are th^ average values of the item effect for the 
two. samples of items, and they reflect differences " in dif fixity level from 
one randomly-selected instance of the testing procedure to another. 

Note that the agreement function for norm-referenced tests, given\n.. 
Equation. 37 and the agreement function for domain-referenced tests given in 
Equation 27 are both prod. ucts of deviatiorscores. The difference between 
the two agreement functions is in the nature of the' deviation scores that 
are used. The norm-referenced agreement function is defined in terms of 
deviations from the population mean for fixed sets of items. These Elation 
scores compare the examinee's performance on the set of items to the performan< 



eric; 



Indices of Dependability 

30 ■ 

of the population on the same set of items. The domain-referenced agreement 
function in Equation*^ 27 is dpfined in terms of the deviation of the examinee's 
score froD^j^ fixed cutting score. 

The Indices 9(g) and 6 (g) 

~ ^ — — t * , 

Using the norm-reference^ agreement function in Equation 37 and the 

definition <Sl feJcpe^B^ agreement in Equation 2, we obtain 

. '. . ( - 

A(2) f f '^i - >'i> 'S^j - 'Kj)"" ' ' vCp) 

• ■ 'I . , 

and using Equation 25 to replace S « and S in Equation 40, 

A{g_) = f In + {^B,e) J * [it ,+ {iTB,e) ] = a^Cir). ''''(41) 



Similarly, the maximum expected agreement is found by using the norm- 
referenced agreement function ,and th^ definition of maximum, exf>ected agreement 

II 

in- Equation 6: '-^ 

A^(a) - ^ = 02{TT) + , (42) 

~ v,l — — n . 



where n is the mimbfeJt;^ of items sampled for each instance of the testing pro- 
ce'dure. Subst'ituting Equations 41 and 42 in Equation 5, an index of depend- 



.ability for norm- referenced tests i's: 



\ 



(TT) 

9(£) = — — . \ (43) 

02{tiB,e) 

n 
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Using th€ norm-referenced agreement furtction in Equation 37 and "the 
definition erf chance agreement in Equation 8, the ej^ecte.d agreement due 
to chance is : ■ ' 



i^^aK " ? X(s^- - y ) (s - M )] 




t f% -^^^B^e) ] • -.C7r^+ (iTB,e) J 



Since all of the effects in this equation are assumed to be ^sampled inde- 
pendently^ 



and , there fore , 



The correction for chance has no effect on the norm-referenced dependability 
index, because a correction for chance is built into the norm-referenced 
agreement function in Equation 37. - ■ ^ <- 

. : ' • / 

Norm- Re f erenced Loss and (6)- , / 

The loss associated with the norm- referenced agreement function is found 
by subtracting Equation 41 from Equation 42 : 

L(£) = A^is) - A(£) = 0^-m.Q)/n . ••(4^) 
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This loss is simply the error variance designated /by Cronbach et al (1972) 
as 0^(6), which is also the error variance in classical test theory. 



In norm- referenced testing, we are ifjterested irvTtK^^elative ordering 
of individuals with respect to their^ test performainGe, for exaral^ia, whether 
student A can solve his problems more quickly than student B" (Glaser /"^l-'S^S^^^ 
p. 519) . Thus, our interest is in "the adequacy of the measuring procedure 
for making comparative - decisions" (Cronbach et-^al,, 197,2, p, 95)* In this 
situation, the error for a given person, as defined by Cronbach et al. (1972) 



IS 



1. 



<5 • = (S ^ - y^) - (y - y) 

V -vl I V 



= [u + ir + + (716,6) ^ - u - 6^] - [y + 

V I — vl I V 



The variance of 6^ over the population ''of persons and samples of items is 



o^(6.) = a'^(Tre,e)/n = Ug) . (47) " 



From Equations 14 and 45 



A (a) J 

e(£) = e (g). 



^ A(g) + L(g) 



and substituting Eguations 41 and 47 
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which is identical^to the, generalizability coefficient ep2 given the ;> ■ 

random effects linear mc^del in Equation 25. . [Equatipn 48 ds also eqiiivalent 
tb Cronbach's" (1-951) coefficient alpha and t6 KR-20 for dichotomously scored 

items. 1 . . . ^ ' ) . 



Interpretation ^^of 0 (^) = e {£) - 



^he third block of Table 1 surftmarlzes .results for -the parameters, 

and L<£), and the agreement ifidices; e.(a) and 6^ (£) , for 
the norm-referenced agreement f unctioni^ 2- f * 

Equations 43 and 48 can also be interpreted as an intraclass correlation 
coefficient, and, as such, they are approximately equal to the expected corre- 
lation between random instances of the testing procedure (i'.e. , independeijf 
random samples- of n items) . Estimation- procedures for generalizability cqef fi 
cients-are discussed by Cronbach et ai. <1972) , and by Brennan -(19773 ) . 

From Equations 35 and '43 (or 48) / note that e^(d)'and 8^{g) incorporate 
the same expected agreement (or signal) but different definitions of error* 
variance {loss or noise). For e^(d) the error variance is a^iA), antj for 
e^(g) th6 error variance is 0^(6). It follows that ' ' 



e (d) < 9 (a) 

because 
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The -difference between 0*^(A) and a^(6) is simply a^{&)/n. Therefore. 8 (d) 
^^^^ equal only when B is a constant for all instances of the testing 

procedure. The variance component for the main effect for items^ {^) , 
reflecfe,s differences in the mean score (in the population) for different 
saroples^of items. If we are interested onlj* in differences among examinee 
universe scores, as in norm--referenced testing^ then any effect which is a 
constant for all examinees does not contribute to the error variance. However, 
for domain --referenced testing, we are interested in the absolute magnitude 
of examinee universe scores, or the magnitude compared to some externally 
defined cutting score* In this case, flucttiat ions in mean Scores for samples 
of items do contribute to error variance. 

The Effect of Item Sampling . on the Indices of Dependability / and 0 

-* . . J. 

Items Nested withi^^ Pe rsons in Wje D Study 

We have examined the implications of using several definitions of agree- 
ment for randomly parallel tests. We have assumed that, for each instance^ of 
the testing procedure, a random sample of items from som^ infinite domain is 
administered to all examinees; i*e., items are crossed with examinees. Follow-- 
ing Crogpibach et al. (1972) , this design is designated £ x \i « '-Indices that 
are appropriate for other designs can be. derived using the approach discussed 
above. A particularly interesting and use*ful set of indices is obtained by 
assuming that an independent random sample of items is selected for each 
examinee. Following Cronbach et al. (1972), this design is designated 
wherp t]>ie colon means "nested within." 
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^. In 'this section it will be convenient to make use of the distinction 
between a G study and a D st«dy-a distinction originally drawn by Rajaratnam 
(1960) ind subsequently discussed extensively by Ci;onbach, et al . (1972). 
The purpose of a G study, or generalizability . s^tudy , is to exan,ine the depend- 
ability of some, measurement procedure. The'purpose of a D study, or decision 
btudy, -is to provide the data for making substantive decisions. "For example, 
the published estimates of reliability for a college aptitude tffest are'based 
on a G study? College personnel officers employ these estimates t9 judge the 
accuracy of data they collect on their applicants (D study) " (Cronbach, et al., 
1972, p. i&). Ths. principal results of a G study are estimates of variance 
components, which can then be used in a variety of D studies. The G st^^dy 
and the D study may use the same design "or different designs.' Generall/, G ' 
studies are most useful when they employ crossed designs and large sample sizes 
to provide stable estimates of as many variance components as possible.' 

In previous sections of this paper,, we have implicitly assumed that both ' 
the G study and the D study used the- crossed design, £ x i.. We will continue 
to assume that variance components have been estimated froro the crossed design. 
However, In this secti^ we will, assume that the D study employs the i..£ 
design. For example, il^' computer-assisted testing it is frequently desirable 
<or even necessary for security reasons) that* each examinee receive a different 
s^t of items; i.e. , the D study uses an i:£ design. . However, even in such -cases 
it is .desirable that the variance components be estimates from the crossld 



design, £ x i. 



If in the D study each examinee gets a different »set of items, the item 
effect will no^ be the .same for all examinees. Under these circumstances, 
linear model/for scores on a particular instance of the testing procedure is: 
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and the^tem effect is now cbnfowided with the residual, (irg^e) It is 

particularly important to note tViat, for Equation 49/ 



•J 



s ^ = y , 



i 



where u is the grand mean in the population of persons and the universe of 
items. The population mean of the observed scores, S , does not equal \x 
whic'h is the expected value over the population for a particular set of items , 
1. When Items are nested within persons ^ taking the expected value of the 
observed cores over the infinite population of examinees implies taking the 
expected value over an infinite universe of items. 



Implications for Norm-Referenced and Domain - Re f er en ced Indices of Dependability 

Using the linear model in Equation 49 and the norm-referenced adreement 
function in Equation 37, it can be shown that: 

A(a«) =. 02 (ir) , . \ 1 (50) 



A (a*) = a^Cir) + — ^ +^ '^v^ > , (51) 





and A <a')'= 0 , • . (52) 

c 

■ • ' ■ r- 

where the prime following, £ differentiates quantities associated with the 



nested design, i^5£r from quantities associated with the crossed design , £ >^ i* 



i2 
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Substituting- these results, in Equations 5 and 12, we obtain 



^ . ' (7i) 
9-(g') = B (^y = : — , 



at{Tr) + _-!^ — , , :L_ 



'(53) 



n 



a2 (7T) 



h) a'- (A) 



(54) 



Note that tK^th e(^') ana 9^(5^') are identical to e^(d) , the do^naiAef^renced 
dependability index, corrected for chance, in Equation 35. The only difference 
between ec^') and the usual dependability index for norm-referenced tests 
0(1), is that 8(1') has an additional terro, (^)/n, in the de^omincrtor . 

For nortn-referenced tests, when the same items are administered to all 
exatainees, the ^tem effect, B^, is a constant for all examinees, and a^m/n 
does not enter the erroj variance. If items are nested withi^ examinees, 

however, g^, will generally be different for each examinee, and^02 <e)/n is parr 
of the error variance. ; ^ ' *^ 

For the domain-referenced agreement function, the agreement indices developed 
from the nested model are identical to those developed from the- crossed model: ' - 



O(d') = 8(d) 
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The dependability of a domain -referenced testing procedure is nl^ affected - ♦ 
by whether the. D study uses the crossed design, £ x i, or the nested design / 
i:E. The aim of domain-referenced testing "^is to provide point estimates of 
examinee universe scores, rather than to make comparisons among examinees. ' 
The dependability of each examinee's sco're is determined by the number of 
items administered to that 'examinee, mt by how many, items ' f ^' 

or which items are a^inistered to othet examinees. 

Standardization of the items used in any instanbe of the tes^ting pro- 
cedure improves the dependability of norm-referenced interpretations but \ ' 
does apt improve th^ dependability of domain-referenced interpretations. 
. furthermore, . the use of different samples of items for different ^amine^s , 
•will tend to- improve estimates o'f group means. If/ therefore, domain- 
.referenced tests are to be used for program evaluation, the selection of * 
independent samples of items for different examinees provides more depei^le- • 
.estimates ^ of group means without any loss In the dependability of estimates 
of examinees' uni-yerse scores. : . ' ^ ^ ' * . 
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, . Suronary and Conclusions 

■« • - — 

Tablfe 1 provides an overview of the major results derived and dis- 
• cussed in this paper, indices of dependaMXity, 0 and are discssed 

for three differ^p^t agreement functions: the threshold agreement function, 
t>"the dotnain-referenged agreement function/ d^ and the norm-referenced 
agreement- funct^ion, jf. This paper emph'asizes considerations relevant to 
the first two agreement functions, because, the indices of dependability 
associated with them are .indices that 'have been proposed for domain-referenced 
-and mastery t^sts. The norm-ref erenced agreement function, is considered 
primarily for purposes' of comparing il: wl'th' €he other two" agreement funckons. 
Th4 main purposes- <.f this-' generalized treatment of indices of dependability 
taire t^ provide an internally consistent framework for deriving indices of 
dependability for domain -referenced tests, and to examine the implications 
of, choosing a particular indexl , ' ^ 

Choosing an Index^ of Dependability 

Our discbssion> of these issues has not dictated which itidex ' 
an ev^luator should' choose in a particular context, but our discussion has 
indicated that two main "issues are involved in such a choice: (a) the nature 
of agreementlWions (or, alternatively, loss functions), and (b) the use 
of index corr.e\ed for chance or not corrected for chance. 
. , With respect tcrtthe first issue, two type^ of agreement functions have 
been considered for mastery tests: the threshold agreement function (Equation 
16 or 17) and the domain -referenced agreement function (Equation 27),. The 
threshold agreement function is appropriate whenever the only distinction 
thkt.can be made usefully is a qualitative distinction^ between masters and 

i5 
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noH'-masters. If, however, different degrees of mastery and non-mastery 

\ "* . > " 

exist to an appreciable extent J the threshold agreement function is not 
appropriate because it ignp*:^s such differences • 

In most educational' contexts, differences between masters and non- 
mascers are not purely qualitative* Rather, the attribute that is measured ^ 
is conceptualized as an ordinal or interval scale^, and the examinees may 
possess the attribute to varying degrees even though a single cutting score 
is used to define mastery. In this context it is ^important that examfnees 
who are far above or below the emitting score be classified correctly • The 
misclassif ication of such examinees is likely to cause serious losses* The ^ 
misclassif ication of ex^inees whose level of ability* is close to the cutting ' 
score will involve much ^ss serious losses. Current -.techniques for setting 
the cutting score are not very precise,, and the , choice of a cutting score, 
is to some extent, arbitrary. It is, therefore, relatively less important 
that the testing procedure correctly classify examinees whose level of skill 
is clos^e to the specified cutting score* 

The domain-^ref erenced agreement function, d, in Equation 27 reflects 
these cons^iderations.' It assigns a positive value to the agreement whenever 
both instances of the testing procedure place the examinee in the^ same <iategory^ 
and it assigns a negative value to the agreement when the two instances place 
an examinee in different categories. ' Furthermore, the magnitude of the agree^ ^ 
ment is determined by/ the distance of the observed scores from the cutting 
score on th^e two instances of the procedure. ^ ^ , 

The second issue in choosing an index of dependabil|.ty is whether to use 
the index 9, which is not corrected ^*r chance agreement^ or the index 9 , 
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which 1^ corrected for chance. There is no > reason to prefer one 

index over the other in all contexts. The two indices provide different 
information, and, therefore, should be interpreted, differently . Tor^ judge- 
ments about the dependability of a_ decision procedure, as applied to a 
particular population, indices that are not corrected for chance are more. 
V appropriate. For judgements about the contrib ution of tests to the depend- 

ability of the decision procedure, indices that are corrected .for chance are 
more appropriate. Subkoviak (Note 3) makes similar statements in his response 
to Huynh's (Note 5) criticism' of coefficient kappa Ce^(t) given by Equation 23], 

It is also useful to note that whether one chooses 9 or 8 , the expected " 
loss or error variance remains unchanged, .rhat^s, the choice between 8 and 



9^ usually affects the strength of the signal in a testing procedure, but 
never the strength of the noise (see Brennan and Kane, 1977b). In effect, 
when one chooses 8^, the strength of the signal is reduced by an amoxant 
attributable to chance, and it is this reduction of signal strength that 
^causes 9^ to be less than 9, usually. As noted previously, for the norm- 
referenced agreement function, g, 8 always equals 9 becaus^ chance agree- 
ment is zero. Indeed, this is probably We reason why the distinction 
between indices such as 0 and 6 has been ignored in, much of the literature 
on testing and psychotne tries * . ^ 

Prior Information . * . - 

^ For the domain-referenced agreement function, d, e (d) equals 0 {d) when 

(w - A,)^ equals zero, i.e., when the mean, p , equals the cutting score, A. 
Ih such^ cases, ptior information about u is of no use in classifying examinees 
as masters-; or non-masters? and the dependability of decisions depends entirely 
upon the dependability of the test feeing used. If (p - A)? is very large. 
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dfecisioRS made about a student' s inasteri^ or non-mastery status, solely on the 
basis of prior" information about y, maV be highly dependable. If, however, 
(u ~ A)^ i^ non-zero but not very large cbmpared to the expected loss, "o^ (A) , 
it is^lij^fely that'^'^the dependability of decisions could.be improved by using 
BayesiAn methods. ^ 

. BayesiaT^ procedure's (Hainbleton anS Novick, 1&73;. Swaminathan, Haxnbleton,- 
and Algirta, 1975) take advantage of prior information about the population b^^^ 
using this information and the student's observed score. to estimate the student '^s 
universe 5core. The optimvm weighting of 'prior information and test- scores 
depends on the -prior distribution of universe scores in the population^^he 
dependability of the testing procedure, and the agreement function (or 
equi^alently, the loss function) that is chosen. Although the published 
applications of Bayesian methods have used a threshold loss, these methods 
are, in principle, equally applicable for the domain-referenced loss, (L) . 

Assumption s about Parallel Tests 

Throughout this paper, we have assumed that two tests are parallel if 
they invoive random samples of the same number of items from the same universe, 
or domain, of items. That is, we have made the assumption of randomly- 
parallel tests, rather than the stronger assumption of classically parallej 
tests. Cronbach et al. {1©72) have shown that either assumption can be used 
as a baj^is for defining the generalizability coefficient for the persons 
crossed with items design; and we have shown that this generalizability 
coefficient is identical to B {^) = ^^^3) norm^^referenced tests. Also, 

either assumption, in conjunction with the threshold agreement function, can 
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be used to derive the indices 9 ( t ) and 6^(t). It is interesting to note, " 
however, that the Huynh {1,976) and Subkoviak (1976) procedures for estimatijig 
7{t) and 9^(t) necessitate the assumption of classically parallel tests. 

We have argxied that the assumption of classically parallel .tests^is 
generally inappropriate for a domain-referenced test because, for a domain- 
referenced test, our interest is focused on an examinee's universe score 
without regard -to the scores "of other examinees. However, if all items in 
the universe are equally diffi^t for the population of persons, then the 
item effect, 6^, in Equation/24 is a constant for ail items, and (d.) equals 



c^(6). That is, the expected loss for. -the domain-referenced agreemeot function 
equals the expected loss for the norm-referenced agreement function: In this 
case, the index 9(d) in Equation- 33 is identical to Livingston's .(1972a,. 1972b, 
1972c, 1973) coefficient. 

The differences between 9(d) and Livingston's coefficient are^ therefore, 
a direct result of the differences between the assumptions of randomly parallel 
tests and classically parallel tests, respectively. It is important to note, 
hov/ever, that neither index is corrected for chance. They both reflect the 
dependability of a decision procedure, not the contribution of tests to the 
dependability of a decision procedure. *Also, for both coefficients changes 
in the cutting score. A, affect the coefficients' magnitudes through the > 
signal strength, not through the noise mc error variance. 



49 



Indices^^f Dependability 

Concluding Coininents 

Througilout this paper we have concentrated upon indices of dependability 
for domain-referenced tests, and factors that influence the use and- inter^- 
pretati^ of such indices. We have particularly errphasized the indices Q (d) 
and 6^(d) because they have- broad applicability in, domain -referenced testing, 
they are easily compared with the usual norm-referenced indices of depend-^ 
ability, and they can be* developed using principles from general iz ability 
theory— a broadly applicable psychometric model* Using principles from 
generalizability theory, it is rela1:ively straightforward to define 8 (d) and ^ 
^^^4^ ANoVa designs ^ther than the s^^le persons-crossed-with-items 

design. (See, for example, olfef treatment of the items nested within persons. ^ 
design.) The extension of Q (t) and 9^(U\to other designs is ^ot so straight- 
forward. 

However, no matter which index of dependability an evaluator chooses, 

it is important that, the evaluator recognize the underlying^ assumptions and 

interpret results in a meaningful manner. In this regard, it is often the 

case thatt the magnitude of. an index of dependability, alone, provides an 

insufficient basis for decision-maiking* It is almost always best to provide, 

also, the quantities that enter the index (A, A , A , and L in Table 1), as 
/ — —m 

well as the estimated variance component^ (see APA, 1974) * 
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Table 1 

Coefficients for Different Agr.eement Functions 



Agreement Function 



Parameters 



Agreement Coefficients 



Threshold: 



1 if S , = S , 
-vk — wl 

0 if S , ?f S ^ 
— vk — wl 



A(t) = Ip. . 
. —11 

—in — 

A (t) = Zp.2 
—c — —1 



L{t) = i;p. . (i ^ 2) 



0 (t) = 

c — 



1 - tp^' 



Domain -Referenced : 



d(S. ,S ) 

= .(s _ ■ 

— vl 



>.) (S _ - A) 



. A(d) = (p - ■A)2 + oHir) 

A (d) = (u - + o2<ii) + a^ih) 

A (d) = (M - %2 



e (d) 

c — 



<K - A) 2 + 02 (tr) + a^CA) 



Norm- Referenced : 



vl — wJ 



—/I I -wJ J) 



A{g) ■= a^i-n) 

A (g) -= 02 (TT) + o2(6). 
— m .-^ 

A (g> = 0 

Q ^ 

L(£) = 02(5) 



0^ (tt) 



e (£) = e (£) = ^ 

- a''-{v) + 02 (<5) 



